Web Analytics Made Easy -
StatCounter strip_tags via regex - CodingForum

Announcement

Collapse
No announcement yet.

strip_tags via regex

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • strip_tags via regex

    Hi , I am having issues with a CMS using a Rich Text Editor in that users who are cutting and pasting snippets from word documents etc are bringing with them LOADS of MS rubbish in the way of MS only attributes & CSS and often smart-tags etc.

    I am using strip_tags at the moment but I still end up with unwanted attributes from allowed tags , <img & <anchors need to keep their attributes , everything else (attributes) must go .. any takers
    resistance is...

    MVC is the current buzz in web application architectures. It comes from event-driven desktop application design and doesn't fit into web application design very well. But luckily nobody really knows what MVC means, so we can call our presentation layer separation mechanism MVC and move on. (Rasmus Lerdorf)

  • #2
    I suppose i would take the easy road and just build an array of MS 'things' i wanted to get rid of, and then replace them with Nothing. Like ...
    PHP Code:
    // Provides: Hll Wrld f PHP
    $vowels = array("a""e""i""o""u""A""E""I""O""U");
    $onlyconsonants str_replace($vowels"""Hello World of PHP"); 
    from http://be.php.net/str_replace

    ... but then replace the letters with unwanted tags

    I don't know what sort of MS stuff you want to get rid of, but if they are tags like <billyboy>, then you can't use regex, since you cant search on a pattern (unless you would make the regex that long that you cover all tags you actually want to keep.

    So the simple str_replace might take you 10 minutes of typing, but i somehow get the impression that setting up a regex will take even longer + will be slower at runtime.
    Last edited by raf; Feb 27, 2004, 09:42 AM.
    Posting guidelines I use to see if I will spend time to answer your question : http://www.catb.org/~esr/faqs/smart-questions.html

    Comment


    • #3
      Hm, I just wanted to give you a pattern, but I'm not sure what exactly you want.

      Could you please give an example?
      What do you want to strip?
      What do you want to keep?
      And why strip_tags isn't working?
      www.united-scripts.com
      www.codebattles.org

      Comment


      • #4
        Not sure which text editor you are using but for htmlArea which seems to be popular someone has been putting together a word cleaner that is suppose to help clean out the junk code that MS Word puts in it.

        Spookster
        CodingForum Supreme Overlord
        All Hail Spookster

        Comment


        • #5
          Hi Spooks indeed it is htmlarea which has solved a lot of my RTE issues as its x-browser&platform.
          That link was really usefull cheers ... though I still want to do a serverside check as well.

          Raf , speed of execution is not really an issue here as this is only used in adminstration pages and not per general page view, I dont use regex in page views unless I absolutely have to.

          Piz , strip_tags does not work as it only strips known HTML tags, so MS kak like <citylace>Perth etc get ignored , & just as important I want to allow some tags , <b> <ul> etc but want to stip any attributes of the allowable tags except for anchors and images.

          so the general plan is ... stip ALL tags unless allowable , and for the allowable tags strip attributes unless a specified type (img/a etc) , I am currently trying to do the above 1 at a time , but would appreciate any pointers.
          resistance is...

          MVC is the current buzz in web application architectures. It comes from event-driven desktop application design and doesn't fit into web application design very well. But luckily nobody really knows what MVC means, so we can call our presentation layer separation mechanism MVC and move on. (Rasmus Lerdorf)

          Comment


          • #6
            ok after stealing some regex from the manual user notes I am up to here so far , it seems to kill all the tags (though still having issues with MS <?xml:namespaces/>) and allow though allowable tags stripping the attributes.. , still working on the allowable attributes in the matches callback (currently finds the allowable tags but as of yet does nothing woith them).. any help or issues noted appreciated !

            PHP Code:
            <?php
            function matches$array ){
                
            //allowable attributes for given tag names
                
            $src_attr = array( 
                    
            'img'=>array( 'src' 'border' 'class' ) ,
                    
            'a'=>array( 'href' 'target' 'name' 'class' ) ) ;

                
            //grab allowable attributes
                
            if( !empty( $array[2] ) && in_array$array[1]  , array_keys$src_attr ) ) ){
                    
            //???//preg_match( "/(.*)=\"?(.*)\"?/u",$array[2],$bits) ;
                    
            echo $array[2];$atrs '';
                }
                
            $array[1] = str_replace
                    array( 
            'br' 'hr' ) ,
                    array( 
            'br /' 'hr /' ) ,
                    
            $array[1] ) ;
                return 
            '<' $array[1] . $atrs '>' ;

            }

            function 
            rte_clean($str) {
                
            $allowed "img|br|b|i|p|u|a|center|hr|ul|li|h3|h4|h5|hr";
                
            //lose any tags unless in $allowed//
                
            $str preg_replace("/<((?!\/?($allowed)\b)[^>]*>)/xis"'' $str);
                
            //callback for allowable attributes in allowable tags
                
            $str preg_replace_callback("/<($allowed)(.*?)>/i"'matches'$str);
                return 
            $str;
            }

            echo 
            stripslashesrte_clean$str ) ) ;
            ?>
            resistance is...

            MVC is the current buzz in web application architectures. It comes from event-driven desktop application design and doesn't fit into web application design very well. But luckily nobody really knows what MVC means, so we can call our presentation layer separation mechanism MVC and move on. (Rasmus Lerdorf)

            Comment


            • #7
              ok , latest version using some regex Mordred gave me some time ago for the attributes (that I messed around with so its less flexible than it originally was)

              mostly working though <a tags are currecntly getting ignored , any improvements appreciated..

              PHP Code:
              <?
              function htmlarea_clean_matches$array ){
                      
              $array[1] = trim$array[1] ) ;
                  
              //allowable attributes for given tag names
                  
              $src_attr = array( 
                      
              'img'=>array( 'alt' 'title' 'src' 'border' 'class' ) ,
                      
              'a'=>array( 'href' 'target' 'name' 'class' ) ) ;
                  
              $self_closers = array( 'br' 'hr' ,'img' 'input' ) ;
                  
              $end = ( in_array$array[1] , $self_closers ) ) ? ' />' '>' ;

                  
              //grab allowable attributes
                  
              if( !empty( $array[2] ) && in_array$array[1]  , array_keys$src_attr ) ) ){
                      
              preg_match_all"/(\S+?)(?:\s*)=(?:\s*)(?:\")([^>\s]+)(?:\")/Uxis" $array[2] , $bits ) ;
                      if( !empty( 
              $bits[0] ) ){
                          
              $atrs '' $x =;
                          while( 
              $x count($bits[0] ) ){
                              if( 
              in_array$bits[1][$x] , $src_attr[$array[1]] ) ){
                                  
              $atrs .= ' ' $bits[1][$x] .'="' $bits[2][$x] . '"';
                              }
                              ++
              $x;
                          }
                      }
                  }
                  return 
              '<' $array[1] . $atrs $end ;
              }

              function 
              htmlarea_clean($str) {
                  
              $allowed 'strong|img|br|b|i|p|u|a|center|hr|ul|li|h3|h4|h5|hr' ;
                  
              //lose any tags unless in $allowed//
                  
              $str preg_replace"/<((?!\/?($allowed)\b)[^>]*>)/xis" '' $str ) ;
                  
              //callback for allowable attributes in allowable tags
                  
              $str preg_replace_callback"/<($allowed)(.*?)>/i" 'htmlarea_clean_matches' $str ) ;
                  return 
              mysql_escape_string$str );
              }
              ?>
              resistance is...

              MVC is the current buzz in web application architectures. It comes from event-driven desktop application design and doesn't fit into web application design very well. But luckily nobody really knows what MVC means, so we can call our presentation layer separation mechanism MVC and move on. (Rasmus Lerdorf)

              Comment

              Working...
              X